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Abstract 

Background: Aberrant activation of signaling pathways downstream of epidermal growth factor receptor (EGFR) 
has been hypothesized to be one of the mechanisms of cetuximab (a monoclonal antibody against EGFR) 
resistance in head and neck squamous cell carcinoma (HNSCC). To infer relevant and specific pathway activation 
downstream of EGFR from gene expression in HNSCC, we generated gene expression signatures using immortalized 
keratinocytes (HaCaT) subjected to ligand stimulation and transfected with EGFR, RELA/p65, or HRAS Val12D . 

Results: The gene expression patterns that distinguished the HaCaT variants and conditions were inferred using the 
Markov chain Monte Carlo (MCMC) matrix factorization algorithm Coordinated Gene Activity in Pattern Sets 
(CoGAPS). This approach inferred gene expression signatures with greater relevance to cell signaling pathway 
activation than the expression signatures inferred with standard linear models. Furthermore, the pathway signature 
generated using HaCaT-HRAS Val12D further associated with the cetuximab treatment response in isogenic 
cetuximab-sensitive (UMSCC1) and -resistant (1CC8) cell lines. 

Conclusions: Our data suggest that the CoGAPS algorithm can generate gene expression signatures that are 
pertinent to downstream effects of receptor signaling pathway activation and potentially be useful in modeling 
resistance mechanisms to targeted therapies. 



Background 

Aberrant signal transduction pathways induce and main- 
tain many cancers [1-3]. Therefore, targeted therapeutics 
blocking this aberrant cellular signaling activity can im- 
pede malignant progression. However, the clinical suc- 
cess of targeted agents relies on accurate identification of 
the relative contribution of the targeted pathway to the 
malignancy prior to treatment. For example, EGFR over- 
expression is associated with poor prognosis of head and 
neck squamous cell carcinoma (HNSCC) leading to 
adoption of EGFR-targeted agents including cetuximab, 
a monoclonal antibody against EGFR, for clinical 
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management [4-7]. Although many HNSCC patients 
benefit from cetuximab treatment, a majority of patients 
do not respond or eventually develop acquired resistance 
after the initial clinical response. Previous studies have 
implicated activation of epistatic signaling intermediaries 
downstream of EGFR activation in cetuximab resistance 
[8]. Thus, inference of aberrant pathway activity con- 
trolled by EGFR activation may shed light on molecular 
underpinnings of acquired cetuximab resistance in 
patients with HNSCC. 

Gene expression profiles constitute an important tool 
to investigate and predict biochemical network activity in 
complex cellular systems such as human tumors. Stand- 
ard class discovery techniques, such as hierarchical clus- 
tering [9] and singular value decomposition (SVD; [10]), 
can implicate gene expression activity common to sub- 
sets of samples. However, clustering algorithms are 
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unable to account for the reuse of genes in diverse bio- 
logical processes as is common in eukaryotic systems 
[11,12]. Moreover, algorithms such as SVD infer complex 
combinations of responses across all measured genes, 
without regard for the biochemical structure of cellular 
signaling networks. As a result, these inference techni- 
ques often obscure the relationship of the resulting gen- 
etic signatures to the specific cell signaling processes 
that are active in the measured system [13]. 

On the other hand, coordinated expression changes in 
a priori sets of downstream targets of pathway-activated 
transcription factors can implicate activity in specific sig- 
naling pathways [14-16]. These pathway-inference tech- 
niques predominately use statistical techniques that 
quantify the magnitude of class comparison statistics, not- 
ably t-statistics, in a priori gene sets relating to cell signal- 
ing relative to a background, null distribution (reviewed in 
[17]). Similar to clustering algorithms, co-regulation of in- 
dividual genes by multiple pathways and transcription fac- 
tors will bias these gene set-based analysis techniques [18]. 
Previous studies have shown that Markov chain Monte 
Carlo (MCMC) matrix factorization techniques, such as 
Bayesian Decomposition (BD; [11]) and Coordinated Gene 
Activity in Pattern Sets (CoGAPS; [19]), robustly infer 
gene expression patterns relating to transcription factor 
activity in cancers [13]. 

To generate pathway signatures for HNSCC that are 
well characterized at the molecular level, we use HaCaT 
keratinocytes and isogenic variants thereof. HaCaT cells 
are well characterized in reference to their biological and 
malignant properties [20-23]. Therefore, these HaCaT 
models provide a potential to identify oncogenic signa- 
tures related to signaling responses that are unencum- 
bered by 'background noise' inherent to tumor tissues. 
The HaCaT cell lines have the further advantage as a 
model system because their genetic aberrations closely 
parallel early oncogenic events seen in HNSCC. Specific- 
ally, like HNSCC [24], the HaCaT lines represent a spon- 
taneously immortalized aneuploid human keratinocyte 
cell line of monoclonal origin with increased telomerase 
activity [25], two mutant p53 alleles [26], and absence of 
pl6 a expression due to promoter hypermethylation 
[27]. 

Here we show that the MCMC matrix factorization al- 
gorithm CoGAPS infers gene expression patterns in 
HaCaT keratinocytes associated with modulation of the 
EGFR activity by forced expression, ligand stimulation, 
and pharmacological inhibition [28]. We relate these pat- 
terns directly to activity at the pathway level, in contrast 
to previous studies that have focused on transcription 
factor-level activity [13,29]. In addition, the CoGAPS in- 
ferred patterns predict molecular response to cetuximab 
treatment in an isogenic pair of HNSCC cell lines with 
varying cetuximab sensitivities. 



Results 

Compilation of protein-protein interaction data in the 
signaling pathways downstream of EGFR in HNSCC for 
pathway-level gene set inference 

A number of molecular interactions triggered by EGFR 
activation have been characterized in HNSCC. Figure 1 
summarizes the corresponding pathway diagram of ca- 
nonical protein-protein interactions in EGFR signaling in 
HNSCC compiled from reviews by [30,31]. This diagram 
displays that many of the EGFR-dependent signaling 
events culminate in activation of distinct subsets of tran- 
scription factors (Additional file 1: Additional file 6: 
Table SI). Although not directly downstream of EGFR, 
Addional file 6: Table SI also shows transcription factors 
activated by the Notch and TGF-[3 pathways. These path- 
ways are included in our analyses because they intersect 
with EGFR-dependent signaling events and have previ- 
ously been implicated in HNSCC biology [32-35]. The 
pathway-activated transcription factors and associated 
targets listed in Additional file 6: Table SI were used to 
infer specific pathway activity from Affymetrix micro- 
array measurements of HNSCC using the gene set statis- 
tics from the CoGAPS algorithm (eqs. 2 and 3) and from 
linear models [36] as described in the methods section. 

CoGAPS reveals transcriptional responses to EGFR 
pathway modulation in the isogenic HaCaT model system 
reflected in their gene expression changes 

Isogenic HaCaT variants overexpressing either wild- 
type EGFR or the transcription factor NF-kappa-B p65 
subunit (p65), or mutant HRAS have previously been 
described by [23,28,37,38]. In order to delineate specific 
targets resulting from activating these pathways, we mea- 
sured gene expression patterns associated with increased 
expression of wild-type EGFR (HaCaT-EGFR WT ) and 
p65 (HaCaT-p65 WT ) with and without serum and ligand 
stimulation (EGF and TNF-oc, for 4 h or 8 h) and from 
constitutive activation of HRAS imparted by an activat- 
ing mutation (HaCaT- HRAS Vall2D ) concurrently with the 
pathway modulation demonstrated in [28]. Table 1 fur- 
ther summarizes the experimental design for these gene 
expression data. 

CoGAPS infers the relative presence of patterns in 
gene expression (columns of the matrix A) across sam- 
ples measured (rows of P) from a matrix of microarray 
data D with N genes (rows) and M samples (columns). 
Specifically, it uses the MCMC algorithm described in 
the Methods section to find the optimal, sparse non- 
negative matrices A and P according to the equation 

D~7V(AP,E) (1) 

where N represents the normal distribution on each 
element of the matrix multiplication AP and the i th row 
and j th column of the matrix E is the standard deviation 
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of gene expression for the i l gene and f sample. The 
resulting signals that are common to subsets of the sam- 
ples are summarized in the rows of the pattern matrix P, 
with related gene expression patterns in the correspond- 
ing columns of A. In contrast to standard techniques 
class-comparison algorithms that strictly infer differential 
expression between sets of samples, the CoGAPS ana- 
lysis infers predominant signals from the gene expression 
data in an unsupervised analysis. As a result, CoGAPS 
can capture degrees of gene expression activity that are 
common to various sample classes. A further advantage 
of CoGAPS over standard class-comparison or clustering 
algorithms is its ability to infer activity in subsets of 
genes concurrently affected by the diverse biological pro- 
cesses in each sample type [12,19,29]. 

When decomposing the data from the HaCaT model 
system according to eq. 1, CoGAPS infers six patterns. 
These patterns are robust in that they are conserved 



Table 1 HaCaT experimental design 
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Summary of the experimental conditions in the HaCaT cell line expression 
experiments, provided in further detail in Supplemental File 5. 



across three separate simulations of the CoGAPS algo- 
rithm (Additional file 1: Figure SI). Therefore, all subse- 
quent analyses are performed for the average patterns 
inferred from these simulations. Although the CoGAPS 
analysis is unsupervised, the six inferred patterns plotted 
in Figure 2 separate samples based on the experimental 
conditions summarized in Table 1. Specifically, CoGAPS 
infers patterns that relate to the molecular changes intro- 
duced in the isogenic HaCaT model system; EGFR WT , 
pSS^, HRAS Vall2D and empty vector control. Likewise, 
CoGAPS infers an additional pattern that reflects path- 
way activity resulting from the exposure of cell lines 
to fetal calf serum. When serum starved, both HaCaT- 
EGFR WT and HaCaT-p65 WT show weak upregulation in 
the pattern associated with the control HaCaT-vector 
(Figure 2c; one sided p-value of 0.001 for HaCaT- 
EGFR WT and 0.013 for HaCaT-p65 WT ). Moreover, in- 
corporating the EGF ligand in the media for HaCaT- 
EGFR WT (Figure 2d) and the TNF-ct ligand for HaCaT- 
p65 WT (Figure 2f) moderately increases the association 
of both cell lines to their respective patterns over their 
serum starved counterparts (one-sided p-value of 0.002 
for HaCaT-EGFR WT and 0.065 for HaCaT-p65 WT ). 

In accordance with eq. 1, the inferred patterns from 
the rows of P can be linked to signatures of gene expres- 
sion in the columns of the amplitude matrix A to impli- 
cate relative gene expression in subsets of samples. We 
apply the Z-score gene-set statistic of [13] (eqs. 2 and 3 
of the Methods section) to infer pattern-specific pathway 
activation (repression) from over- (under-) representa- 
tion of large magnitude elements in the A matrix for 
genes that are targets of transcription factors. Whereas 
[13] limited application of this statistic to targets of tran- 
scription factors, Figure 3 displays estimates of activation 
and repression of pathways and Additional file 2: Figure 
S2 the statistics for component transcription factors 
(listed in Additional file 1: Table SI) estimated with eq. 3 
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Figure 2 CoGAPS inferred gene expression patterns. Box plot of six gene expression patterns inferred from the HaCaT gene expression data 
for samples in Table 1. Plotted values are normalized to sum to one across all samples and are averaged across three applications of the CoGAPS 
algorithm. All results for HaCaT-EGFR m are colored in green, HaCaT-HRAS Val,2D in red, HaCaT-EGFR™ 1 in grey, HaCaT-peS™ 1 in blue, and 
HaCaT-vector in black. The y-axis is labeled according to the row of the inferred P matrix plotted in each panel. Specifically, (a) contains the pattern 
attributed to the baseline HaCaT activity, (b) attributed to HaCaT-HRAS Vall2D , (c) HaCaT-vector control, (d) HaCaT-EGFR m (e) serum, and (f) HaQT-p65 wr . 



for the HaCaT data. Therefore, in this application a path- 
way is inferred to be either up or down regulated if its 
targets have a correspondingly larger or smaller Z-score 
in elements of the A matrix relative to a random selec- 
tion of genes. By basing this statistic on the relative 
standard-deviation adjusted magnitude in elements of A, 
CoGAPS directly infers pathway activity that is active in 
each pattern inferred (row of P). Therefore, CoGAPS 
provides a direct implication of relative pathway activity 
in samples in contrast to the inference limited to differ- 
ential pathway activity in gene set statistics in standard 
class-comparison algorithms. Similar to the inferred pat- 
terns, these pathway and transcription factor level statis- 
tics are robust across CoGAPS simulations (Figure 3 and 
S3). 

In analyzing the CoGAPS estimated pathway activity 
(Figure 3), we observe the expected global upregulation 
of STAT, RAS, and AKT pathways in HaCaT-EGFR WT , 



upregulation of RAS in HaCaT- HRAS Vall2D , and upregu- 
lation of AKT resulting in HaCaT-p65 WT . We also ob- 
serve unexpected weak activation of TGF-P in HaCaT- 
EGFR WT and of RAS and TGF-|3 in HaCaT-p65 WT . 
These unexpected signals are likely due to pathway cross 
talk and the observed weak upregulation of RAS and 
TGF-|3 pathways in HaCaT-vector control. The presence 
of serum also enhances this upregulation of RAS and 
TGF-|3. The final pattern is consistent across samples 
and reveals a global upregulation of Notch and STAT 
pathways in the HaCaT cell lines. 

Comparison of gene expression signatures inferred in 
CoGAPS and linear models 

We note that the six CoGAPS patterns are inferred 
without any prior information about the experimental 
conditions in Table 1, whereas standard linear models 
of pathway response require information about the 
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Figure 3 HaCaT expression CoGAPS gene set statistics. Gene set 
statistics for pathway activity of the HaCaT expression data 
calculated from eq. 3 averaged across three CoGAPS simulations. 
Error bars indicate minimum and maximum values of the pathway 
activity statistic in each of the three CoGAPS simulations. Bars are 
grouped and labeled according to the dominant experimental 
condition to which inferred CoGAPS patterns correspond. Within each 
block, statistics are provided for the STAT, RAS, AKT, TGFB, and Notch 
pathways (left to right) colored according to the figure legend. 



pertinent experimental conditions. As a result, linear 
models formed based upon the experimental conditions 
in the HaCaT forced expression experiments may overfit 
expression changes due to treatment and stimulation con- 
ditions. Moreover, they would be unable to account for 
relative changes in the magnitude of expression across 
comparable samples encoded in the relative magnitude of 
rows of P (Figure 2). 

For comparison, we formulate a linear model that 
accounts for the forced overexpression conditions and 
presence or absence of serum based upon the inferred 
CoGAPS patterns. In Figure 4a, we observe that the ex- 
pression patterns inferred for the linear model across all 
HaCaT samples (baseline), all serum treated samples in 
HaCaT, and all HaCaT-HRAS Vall2D samples are most simi- 
lar to the corresponding CoGAPS patterns. However, the 
patterns associated with known molecular variables in 
HaCaT cells cluster separately from the CoGAPS esti- 
mated patterns. These patterns also show a larger differ- 
ence in their clustering than the patterns of molecular 
variables inferred from CoGAPS. 

Figure 4b and 4c show the pathway statistics from the 
linear model gene expression signatures using standard 



gene set tests and a test analogous to CoGAPS, respect- 
ively, as described in the Methods section. In contrast to 
CoGAPS (Figure 3), the gene set statistics from the linear 
model (Figure 4b) show strong upregulation of only the 
AKT and Notch pathways in both the EGFR and RAS 
pathways, with only weak upregulation of RAS, which 
would be predicted from the experimental conditions. 
While the linear model reveals expected upregulation of 
the AKT pathway in HaCaT-p65 , it also infers unex- 
pected upregulation of Notch and strong upregulation of 
the STAT pathway. This latter STAT upregulation due to 
p65 overexpression is inconsistent with the structure of 
EGFR protein-protein interactions (Figure 1), in which 
p65 is far downstream of STAT. In further contrast to 
the CoGAPS gene expression patterns, the linear model 
does not infer expected STAT pathway upregulation in 
the HaCaT-HRAS Vall2D or HaCaT-EGFR WT data. Similar 
to CoGAPS, the linear model infers a strong upregula- 
tion of TGF-|3 from the introduction of serum. Unlike 
CoGAPS, the linear model also predicts that serum 
strongly upregulates signaling in the RAS pathway. The 
signaling patterns are largely similar using the CoGAPS - 
based permutation statistics from eq. 4 in Figure 4c. 
In this case, the linear model statistics do infer a weak 
upregulation of RAS in HaCaT-HRAS Vall2D . However, 
unlike the gene expression patterns from CoGAPS, the 
linear model for HaCaT-EGFR^^ still does not infer the 
expected upregulation of the STAT or RAS pathways. 

Gene expression signatures from CoGAPS distinguish 
pathway-level response in isogenic cetuximab sensitive 
and resistant HNSCC cell lines 

We project the three HaCaT signaling patterns averaged 
across three CoGAPS simulations and associated with 
the HaCaT forced over-expression (HaCaT-HRAS Vall2D , 
HaCaT-EGFR WT , and HaCaT-p65 WT ) onto the gene ex- 
pression data for HNSCC cell lines UMSCC1 and 1CC8. 
As previously described, this pair of cell lines are iso- 
genic sensitive and resistant HNSCC lines, respectively [8] . 
Thus, we can infer whether the signaling-related gene 
expression signatures elucidate the molecular mecha- 
nisms underlying cetuximab resistance in this pair of cell 
lines to potentially model cetuximab resistance in HNSCC. 
Figure 5 summarizes the relative CoGAPS inferred signa- 
ture activity in UMSCC1 with and without cetuximab 
treatment, and 1CC8 with and without cetuximab treat- 
ment. The signatures associated with modulation of 
EGFR WT , pGS^ 1 and HRAS Vall2D are all significantly larger 
in 1CC8 than UMSCC1 (p-value of 7xl0~ 4 for the HaCaT- 
HRAS Vall2D signature, 8x10 s for the HaCaT-EGFR WT sig- 
nature, and 5xl0~ 3 for HaCaT-p65 WT signature). Treatment 
with cetuximab in the sensitive cell line UMSCC1 further 
reduces the activity in HaCaT-HRAS Vall2D (p-value 0.042) 
and HaCaT-p65 wl (p-value 0.004), but upregulated the 
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Figure 4 Comparison of CoGAPS and linear models, (a) Clustering diagram to compare the gene signatures in the columns of A from eq. 1 in 
each of three CoGAPS simulations (simulation 1 in red, simulation 2 in green, and simulation 3 in blue) to the signatures inferred from a limma 
inear model using the same design matrix (black), (b) and (c) display gene set statistics for pathway activity resulting from the limma estimated 
inear model for STAT, RAS, AKT, TGFB, and Notch (left to right within each group) colored according to the bottom figure legend, (b) Uses the 
standard gene set test from limma, rescaled according to eq. 4. (c) Uses the permutation-based gene set statistic adapted for a linear model from 
eqs. 2 and 3 described in the methods section. 



signature associated with HaCaT-EGFR (p-value 0.13). 
A similar trend is observed for 1CC8 (p-value of 0.007 for 
HaCaT-EGFR WT and 0.0008 for HaCaT-p65 WT ). However, 
whereas treatment with cetuximab reduces the HaCaT- 
HRAS VaI12D signature in the sensitive cell line UMSCC1, it 
has no effect on treatment in the resistant cell line 1CC8 
(p-value of 0.23). 

Discussion 

CoGAPS demonstrated improved ability to correctly 
infer patterns with even subtle transcriptional differences 



because this algorithm can accurately account for gene 
re-use [12,19,39]. In contrast to linear models, CoGAPS 
seeks gene expression signatures with minimal structure. 
As a result, the signatures inferred with CoGAPS are 
more similar than those inferred with linear models. 
Therefore, these CoGAPS expression signatures reveal 
degrees of activation of the pathways downstream of 
EGFR due to pathway cross-talk, unlike the standard 
gene set test or permutation gene set analysis for linear 
models. For example, only the CoGAPS analysis can de- 
tect the modest RAS and STAT signals associated with 
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Figure 5 CoGAPS signature in UMSCC1 and 1CC8 cell lines. Box 

plot of the P matrix resulting from projecting the CoGAPS inferred 
gene expression signatures for (a) HaCaT-HRAS Val12D , 
(b) HaCaT-EGFR™ 1 , and (c) HaCaT-peS™ 1 onto gene expression data 
for UMSCC1 (left group) and 1CC8 cell lines (right group). Results for 
data treated with cetuximab are colored in red (right in each group) 
and untreated are colored in blue (left in each group). 



forced expression of EGFR and HRAS in HaCaT cells. 
Moreover, the sparsity in the CoGAPS prior ensures that 
gene expression amplitude is assigned only to pertinent 
patterns. As a result, while the linear model assigns 
Notch activity to each of the forced expression signa- 
tures, CoGAPS correctly assigns this global signature to 
the background term. 

This type of statistical modeling can delineate the key 
pathway activity of relevance to the development of tar- 
geted agents in genetically heterogeneous HNSCC. In 
our previous molecular characterization of HNSCC 
based on gene expression profiling, we have shown that 
there are at least four subtypes of HNSCC and these 
subtypes have prognostic implications reflecting the bio- 
logical heterogeneity [40]. In addition, recent completion 
of whole exome sequencing of HNSCC has shown that 
tobacco-related HNSCC contain average of 20.6 muta- 
tions per tumor and many of the mutated genes are 
tumor suppressors that cannot be easily targeted [34,35]. 
Furthermore, many of these mutated genes are regulated 
in contextual manner meaning that one mutation of a 
gene in a tumor may result in a different phenotype de- 
pending on the genetic context determined by other co- 



existing mutations or deregulations. The analyses of 
global gene expression signature will potentially yield the 
dominant pathways that result from lack of tumor sup- 
pressor or contextual regulated genes, and allow exploit- 
ation as therapeutic targets or biomarkers of clinical 
outcome. It is also a potentially valuable tool of deter- 
mining the on- and off-target effects of targeted agents 
and hypothesis generating approach to unravel the 
mechanism of resistance by unbiased examination of 
global changes in the inter-connected pathways induced 
by inhibition of the targeted pathway. Due to the small 
sample size, we could not infer accurate expression sig- 
natures from the pharmacological inhibition of EGFR, 
MEK or PI3K in our model system. However, additional 
studies to optimize the experimental design and to fur- 
ther validate the model for utilization in experimental 
therapeutics are in progress. 

In this study, the inferred CoGAPS gene expression 
signatures implicate signaling processes in the HNSCC 
system that they model. For example, the gene expres- 
sion signature related to constitutive activation of the 
RAS pathway in the HaCaT-HRAS Vall2D distinguishes 
the transcriptional profile of UMSCC1 and 1CC8 and 
their relative transcriptional response to cetuximab treat- 
ment. This observation suggests that over-activation of 
the RAS pathway cannot be repressed by cetuximab in 
resistant HNSCC cell lines. While HRAS mutations are 
found in HNSCC [34], the UMSCC1 and 1CC8 cell lines 
are HRAS wild type. Therefore, this aberrant activity in 
the RAS pathway that our algorithm inferred for 1CC8 is 
consistent with activation of the wild type RAS pathway 
in cetuximab resistance. One possible mechanism for 
this activation would be the compensatory pathway ac- 
tivity proposed in [8]. Further studies to validate the 
mechanisms underlying these are currently ongoing. 



Conclusions 

This work demonstrates the versatility of the CoGAPS 
matrix factorization algorithm to infer biological signal- 
ing nodes and intermediaries as they relate to specific 
gene expression in immortalized HaCaT and transformed 
variants thereof. For example, upon stimulation/deregu- 
lation of EGFR activity, the algorithm successfully identi- 
fied gene expression signatures consistent with known 
elements of the EGFR signaling network (Figure 1). In 
contrast to linear models, the CoGAPS algorithm per- 
forms pathway inference without a priori knowledge of 
the experimental conditions listed in Table 1. Pathway 
inference by CoGAPS was predictive across a heteroge- 
neous set of experimental pathway manipulations, i.e., 
transcriptional responses triggered either by overexpres- 
sion of EGFR, NF-kB/p65 or mutant HRAS, or those 
induced by addition of serum to culture media. 
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Similarly to the UMSCC1/1CC8 model, we hypothesize 
that the pathway- level gene expression signatures in- 
ferred from the HaCaT model with CoGAPS will impli- 
cate relevant molecular mechanisms in gene expression 
data from HNSCC patients in future studies. For example, 
the gene expression signature relating to constitutive RAS 
activation estimated from the HaCaT-HRAS Vall2D provides 
a potential biomarker to infer patient-specific cetuximab 
sensitivity and resistance prior to clinical treatment. More- 
over, previous studies have also linked cetuximab resist- 
ance to increased AKT pathway activity in HNSCC cell 
lines with cetuximab resistance [8]. We hypothesize that 
application of the CoGAPS algorithm to future samples of 
HaCaT with forced expression of RAF and PI3K with 
activating mutations will distinguish whether the activa- 
tion of the RAS pathway induces AKT pathway activity 
preferentially over MAPK pathway (Figure 1) in cetuximab 
resistance. When applied to tumor data, these additional 
CoGAPS inferred gene expression signatures would also 
provide candidate molecular targets for patients with pre- 
dicted cetuximab resistance. 

Methods 

Inhibition and stimulation of EGFR signaling pathway in 
the HaCaT model system and HNSCC cell lines 

HaCaT cells overexpressing EGFR (HaCaT-EGFR) were 
maintained in cell culture media (W489) as previously 
described [20,22]. The cells were seeded in 100 mm tis- 
sue culture plates in regular media until they reach 70- 
80 % confluency. After incubation with serum-free media 
for 12 hours, EGF or TNF-a (10 ng/ml, Sigma-Aldrich, 
St. Louis, MO) were then added for 4 or 8 hours before 
cells were harvested for total RNA isolation. The culture 
conditions, total RNA isolation and microarray experi- 
ments of SCC1 and 1CC8 HNSCC cell lines used in the 
study were previously published [8]. 

Microarray data preprocessing 

To ensure that the microarray measurements from the 
HaCaT samples and the UMSCC1/1CC8 cell lines are 
comparable, we normalize all microarray measurements 
using fRMA [41]. After fRMA normalization, clustering 
reveals an apparent batch effect from processing date in 
the HaCaT expression data (Additional file 3: Figure 
S3a). We, therefore, fit a linear model with date and ex- 
perimental conditions summarized in Table 1 using the 
lmFit function in the R Bioconductor package limma 
(Linear Models for Microarray Data; [36]). Additional file 
3: Figure S3b shows that samples correctly cluster by ex- 
perimental condition after removing the modeled date 
effect using the clustering diagram generated with [42]. 
This batch corrected, normalized data will be used for 
subsequent analyses of the HaCaT cell lines. For robust- 
ness, the clustering and linear models were performed 



on all HaCaT samples, including eight samples that were 
treated with pharmacological agents. These latter samples 
were excluded from subsequent analyses due to an inabil- 
ity to infer accurate expression signatures from their small 
samples size. On the other hand, the UMSCC1 and 1CC8 
samples are processed in the same batch, requiring no 
correction after fRMA normalization. All the data are 
MIAME compliant and the raw data for all samples have 
been deposited in Gene Expression Omnibus (HaCaT data 
GSE32975; SCC1 and 1CC8 data in GSE21483). All ana- 
lyses are performed with software R using scripts provided 
as Additional file 7: Files SI, Additional file 8: File S2, 
Additional file 9: File S3, Additional file 10: File S4 and 
Additional file 11: File S5. 

Pathway and transcription factor targets 

We identify candidate transcription factor regulators for 
each probe of the Affymetrix U133 Plus 2.0 array from 
TRANSFAC using the Automated Sequence Annotation 
Pipelines (ASAP; [43]). The list used in these analyses 
was obtained and frozen on December 8, 2010 and pro- 
vided as a supplemental file containing the ASAP results 
(Additional file 7: File S2). Gene sets related to each 
pathway listed in Additional file 6: Table SI are defined as 
the targets of each transcription factor identified as down- 
stream to the pathway from Additional file 6: Table SI. 

We limit all subsequent analyses of HaCaT and 
UMSCC1/1CC8 cell line expression to only those probes 
that are annotated in TRANSFAC. We select a single 
probe for each gene to further avoid biases in gene set 
tests from using multiple probes for a single gene. Spe- 
cifically, we retain the probe for each gene with the smal- 
lest p-value resulting from comparisons of the HaCaT 
experimental conditions using t-statistics moderated with 
empirical Bayes from the limma package [36]. 

CoGAPS pattern inference and pathway analysis 

CoGAPS factors the expression matrix D into amplitude 
(A) and pattern (P) matrices with p patterns according 
to the distribution in eq. 1. The posterior distribution for 
elements of each of these matrices are computed with an 
MCMC Gibbs sampler based upon the atomic prior of 
[44] and implemented in the Bioconductor package 
CoGAPS [19]. Here, the standard deviation in eq. 1 is 
given by E,y = 0.1D,y for each gene i and sample /, based 
upon the established, microarray error-model in [45] and 
previous applications of [13,29]. 

The quality of the CoGAPS fit is assessed through the 
X fit of the posterior mean of A and P and identifiability 
of these matrices across MCMC simulations. Using these 
criterion, the optimal number of patterns p for the 
matrix factorization is the minimum number of patterns 
for which the ft 2 reaches a minimal value and the in- 
ferred patterns persist across simulations. For the HaCaT 
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expression data analyzed in this paper, the Rvalue of the 
fit begins to plateau for 6 patterns (Additional file 4: 
Figure S4), at which point multiple CoGAPS simulations 
obtain the same A and P matrices (Additional file 1: 
Figures SI and Additional file 2: S2). 

Comparing the posterior distribution of the amplitude 
matrix for a pattern (i.e., column of A) between genes in 
a gene set and in the background can indicate over- or 
under-expression of that gene set in a pattern. Based 
upon previous work from [13], we define a Z-score to 
quantify the enrichment of gene-set G with R targets in 
pattern p by 



Z <3-P = n 



Rgeg Ogp 



(2) 



where A gp and a gJl are the sample mean and variance, 
respectively, from the CoGAPS MCMC samples for gene 
(row) g and pattern (column) p amplitude. We compute 
p-values through a permutation test that compares the 
computed value of Zg p to a null distribution obtained by 
the values of the statistic in eq. 2 resulting from selecting 
random sets of R genes. In contrast to [13], we apply this 
statistic to both targets of individual transcription factors 
and targets of sets of transcription factors regulated in 
the pathways (Additional file 6: Table SI) to infer path- 
way level activity. For visualization, p-values computed 
from the statistic in eq. 2 are transformed as follows 



p=2p-l. 



(3) 



Therefore, a transformed p value of -1 indicates under- 
representation of set G and +1 indicates over-representa- 
tion. The associated pathway-level statistics in Figure 3 
represent the mean of the statistic from eq. (3) across three 
CoGAPS simulations, with error bars representing the mini- 
mum and maximum values in each of these simulations. 

Projecting the gene expression signatures in the col- 
umns of A onto additional samples can implicate the 
relative activity of inferred patterns in those samples. In 
this paper, the projection is implemented by solving the 
factorization in eq. 1 for the new data matrix where A is 
fixed as the average of the CoGAPS posterior mean for 
each of the three CoGAPS simulations performed. We 
estimate the patterns P associated with this amplitude 
matrix using the least-squares fit to the new data imple- 
mented with the lmFit function in the limma package 
[36]. Applying this projection to the original, HaCaT data 
reveals that the projection provides similar, albeit slightly 
nosier, estimates when compared to the CoGAPS poster- 
ior mean for P (Additional file 5: Figure S5). This linear 
projection is, therefore, used to project the gene signa- 
tures inferred from the HaCaT data onto gene expression 
data from the HNSCC UMSCC1 and 1CC8 cell lines. 
Differences between the values of the pattern matrix P 



inferred through CoGAPS or these subsequent projections 
are quantified using p-values from a £-test between the 
groups of experimental conditions being compared. 

Linear models and gene set pathway-level analysis 

In the results section, we compare the CoGAPS inferred 
expression signatures in the columns of the A matrix to 
expression signatures inferred by the coefficients of a lin- 
ear model. In this linear model, we use the limma pack- 
age [36] to fit the batch-corrected HaCaT data to a 
design matrix comparable to the P matrix inferred in 
CoGAPS by specifying a common intercept term, HaCaT 
cell type (HaCaT-EGFR WT , HaCaT-p65 WT , HaCaT- 
HRAS Vall2D , or HaCaT-vector), and presence or absence 
of serum. Contrasts are formulated for each of these six 
conditions to obtain coefficients for each condition com- 
parable to the columns of the CoGAPS A matrix 
(Figure 4a). The gene set statistic in Figure 4b uses the 
standard limma package geneSetTest function to test the 
probability of activity inferred from the contrast-based, 
empirical Bayes t-statistics in the pathway and transcrip- 
tion factor level targets in Additional file 6: Table SI. To 
make values comparable to the CoGAPS analysis, only 
the subset of probes analyzed in CoGAPS (i.e., in 
TRANSFAC and summarized by gene) are considered for 
the background of the gene set statistics. Similar to eq. 3, 
we rescale the resulting statistics for visualization by 



l°g 10 (/'down) if Pdovm < Pup, 

-log 10 (p up ) otherwise 



(4) 



where Pd ovm and P up are the p-values resulting from the 
geneSetTest function if the alternative hypothesis is speci- 
fied as down or up regulated, respectively. The heatmap in 
Figure 4b then rescales these statistics across columns are 
1 at the maximum value of p\ m and -1 at the minimum 
value. For further comparison to the CoGAPS results, 
Figure 4c plots the transformed p-values resulting from 
the permutation-based CoGAPS gene set test in eqs. 2 and 
3. To reflect the statistics of the linear model, the posterior 
estimate for the ratio in eq. 2 is replaced with the esti- 
mated, empirical Bayes moderated t-statistics for each of 
the six conditions specified in the linear model. 

Additional files 



Additional file 1: Figure SI. Box plot of six gene expression patterns 
inferred from the HaCaT gene expression data for each of the three 
CoGAPS simulations (pages 1-3) for the samples in Table 1. Plotted 
values are normalized to sum to one across all samples. All results for 
HaCaT-EGFR m are colored in green, HaCaT-HRAS Val,2D in red, 
HaCaT-EGFR m in grey, HaCaT-p65 m in blue, and HaCaT-vector in black. The 
y-axis is labeled according to the row of the inferred P matrix plotted in 
each panel. Specifically, (a) contains the pattern attributed to the 
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baseline HaCaT activity, (b) attributed to HaCaT-HRAS 1 , 

(c) HaCaT-vector, (d) HaCaT-EGFR™ (e) serum, and (f) HaCaT-p65 m . 

Additional file 2: Figure S2. Gene set statistics of the HaCaT expression 
data calculated from eq. 3 for each of the three CoGAPS simulations from 
blue for significantly downregulated to yellow for significantly upregulated 
according to the color bar. Columns are labeled according to the 
dominant experimental condition to which inferred CoGAPS patterns 
correspond and colored as indicated in the column color legend (red for 
the first CoGAPS simulation, green the second, and blue the third). The 
top set of statistics represents the gene set statistics computed at a 
pathway level. Colors along rows indicate the pathway for which 
activation statistics are calculated as indicated in the row color legend. 
The lower set of statistics represents the gene set statistics computed for 
the transcription factors activated by the pathway also indicated by colors 
in the rows associated with the pathway to which the transcription factor 
was assigned and indicated by the color code on the left. 

Additional file 3: Figure S3. Clustering of HaCaT expression data after 
fRMA colored by date (a). Analogous clustering after batch correction in 
(b). 

Additional file 4: Figure S4. x 2 fit from CoGAPS as a function of the 
number of patterns used in the matrix factorization for eq. 1 . 

Additional file 5: Figure S5. Heatmap comparing patterns inferred in 
CoGAPS as plotted Figure S2 (filled boxes on rows) to patterns that would 
be inferred from projecting expression patterns as described in the 
methods (open boxes on rows) colored according to the row figure legend 
As indicated in the row figure legend, patterns are plotted for each of three 
CoGAPS simulations, colored in red (simulation 1), green (simulation 2), 
and blue (simulation 3) along the rows. The bars across the columns 
indicate media and forced expression conditions, colored according to 
the figure legend. Shading of these bars indicates media (white for serum 
starved, grey for serum, green for EGF, and blue for TNFa) while borders 
indicate forced expression (grey for HaCaT™ 1 , black for HaCaT-vector, green 
for HaCaT-EGFR m , blue for HaCaT-p65 m , and red for HaCaT-HRAS Val,2D ). 

Additional file 6: Table SI. PathwayTableS1.txt: List of targets of 
transcription factors annotated to each pathway for pathway-level and 
transcription factor-level gene set analyses. 

Additional file 7: File SI . CompleteAnalysis.R: R script used for to 
generate all analyses and figures. 

Additional file 8: File S2. TF2Gene_2010.R: R script encoding a list of 
transcription factor targets in TRANSFAC from ASAP. 

Additional file 9: File S3. PathwayHeatmap.R: Support script used to 
generate pathway-level heatmaps in Figures SI and S3. 

Additional file 10: File S4. ExperimentalDescription.txt: Detailed 
annotation of experimental conditions summarized in Table 1. 

Additional file 11: File S5. SCC1 1CC8Annot.txt: Detailed annotation of 
the experimental conditions in the UMSCC1 and 1CC8 cell line expression 
datasets. 
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